Output Driven Distributed Optimistic Message Logging and Checkpointing

نویسندگان

  • David B Johnson
  • Willy Zwaenepoel
چکیده

Although optimistic fault tolerance methods using message logging and checkpointing have the potential to provide highly e cient transparent fault tolerance in distributed systems existing methods are limited by several factors Coordinating the asynchronous message logging progress among all processes of the system may cause signi cant over head limiting their ability to scale to large systems and o setting some of the per formance gains over simpler pessimistic methods Furthermore logging all messages received by each process may place a substantial load on the network and le server in systems with high communication rates Finally existing methods do not support nondeterministic process execution such as occurs in multithreaded processes and those that handle asynchronous interrupts This paper presents a new method using optimistic message logging and checkpointing that addresses these limitations Any fault tolerance method must delay output from the system to the outside world until it can guarantee that no future failure can force the system to roll back to a state before the output was sent With this new method only this need to commit output forces any process to log received messages or to checkpoint Each process commits its own output with the cooperation of the minimum number of other processes and any messages not needed to allow pending output to be committed need not be logged Individual processes may also dynamically switch to checkpointing without message logging to avoid the expense of logging a large number of messages or to support their own nondeterministic execution

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs

Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the “outside world,” since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbi...

متن کامل

Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

Manetho is a new transparent rollback recovery protocol for long running distributed computations It uses a novel combination of antecedence graph maintenance unco ordinated checkpointing and sender based message logging Manetho simultaneously achieves the advantages of pessimistic message logging namely limited rollback and fast output commit and the advantage of optimistic message logging nam...

متن کامل

Improving Message Logging Protocols Scalability through Distributed Event Logging

Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existin...

متن کامل

Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems

Message-passing systems with communication protocol transparent to the applications typically require message logging to ensure consistency between checkpoints. This paper describes a periodic independent checkpointing scheme with optimistic logging to reduce performance degradation during normal execution while keeping the recovery cost acceptable. Both time and space overhead for message logg...

متن کامل

Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

In a distributed system using message logging and checkpointing to provide fault tolerance there is always a unique maximum recoverable system state regardless of the message logging protocol used The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice with the sets of consistent and recoverable system ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1990